Spec 038 Phase 3 — Class-A rules, correctness fixes, EC3 results, template typo fix#250
Conversation
Three new Class-A induced rules, motivated by the cross-agent audit at
docs/specs/tasks/038-tuning-reports/2026-05-11-cross-agent-audit.md:
- GridSizeFactoryParensRule (CS1955; 146 events combined gpt-5.5 +
sonnet-4.6; first cross-tier rule since CS1955 is outside Tier-2
SupportedCodes): GridSize.Auto() -> GridSize.Auto (drop the parens).
- GridSizePxRenameRule (CS0117; 9 cross-agent events): GridSize.Pixel /
Pixels / Fixed -> GridSize.Px (WPF/WinUI legacy name -> Reactor's Px).
- TextBlockStyleHintRule (CS1061/CS0117; 5 cross-agent events across
both .Style(...) and `with { Style = ... }` shapes): hint toward
Reactor's fluent text helpers since the element exposes no Style.
ThemeBackgroundSuffixRule reclassified Class-B -> Class-A (paperwork
only; cross-agent audit shows 27 events on the same key).
Two critical correctness fixes uncovered by end-to-end smoke testing —
both blocked any real-world rule firing before this commit:
1. CompilationLoader.ResolveReferences now walks libraries.<id> entries
with type=project in project.assets.json and locates the most-recently
-built matching .dll under that project's bin/ tree. Without this
every rule's DeclaredTargets failed to resolve and the whole registry
self-disabled on real mur check invocations (unit tests passed because
they use synthetic in-memory compilations). Regression locked by
CompilationLoaderTests.Resolves_ProjectReference_built_dll_from_project_assets_json.
2. SuggesterOrchestrator gains a tier2Enabled bool; CheckCommand.Run
always builds the orchestrator (when the compilation loads) and passes
the suggest-gate result in as tier2Enabled. Tier-3 rules always run
when their diagnostic code surfaces; Tier-2 stays gated on small
builds where its fuzzy match has near-0% precision (525-run
calibration). This is the EC2 watch-item ("Phase-3 rules are the
right lever — not Phase-2.x gate tuning") finally addressed in code.
Two new orchestrator tests lock down both halves of the carve-out.
§3.1a per-rule performance bound test landed (was deferred until first
rule shipped): RulePerformanceTests.BestMatch_median_under_per_rule_budget
asserts symbol-resolution + TryMatch median <= 0.5 ms per rule per
diagnostic times 4 CI slack.
Status snapshot in the implementation tasks doc updated to record the
sonnet-4.6 corpus aggregation (368 fixes / 564 ranker rows / 41 clusters),
the cross-agent audit verdicts (3 STRONG Class-A targets, plus
TemplatedListView family that's STRONG-after-generalization-over-<T>,
plus the gpt-5.5-only CS1955/GridElement family deferred to a third
corpus drop), and the rule-PR queue with this commit's three Class-A
rules marked authored.
Branch is for spec 038 EC3 eval — see C:\temp\mur-ec3-handoff.md.
Full Reactor.Tests suite: 7175 passing / 46 expected skips.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The reference doc was scoped to Phase 0+1 plus the suggest-gate. This
update extends it to cover everything shipping since:
- Phase 2 (merged): MSBuild passthrough via `--`, mode flags
(--strict/--final/--quiet/--emit-threshold), the deterministic
pre-emit policy table, and the suppress-to-error guardrail tool.
- Phase 3 (in flight on this branch): the IRulePattern infrastructure,
RuleSymbolResolver / RuleRegistry, --disable-rule + --list-rules CLI
surface, six authored rules (three Class-A induced + three Class-B
vocabulary), the symbol-binding contract from §3.1a, and the
per-rule perf bound test.
- Two critical correctness fixes uncovered during Phase 3 end-to-end
smoke testing: CompilationLoader's ProjectReference resolution path
and the suggest-gate carve-out for Tier-3 rules. Both get their own
subsections in §3 explaining why unit tests passed while production
silently no-op'd, since that failure mode generalises beyond this
spec.
- The cross-agent mining drop (`claude-sonnet-4.6` × 525 runs) and the
audit it produced. New subsection in §4 on comparing models to
separate structural vocabulary-confusion signals from agent-specific
idiosyncrasies; new subsection in §5 on what the second-agent corpus
changed (B->A promotions, single-corpus deferrals, cross-syntactic-
shape rule emergence).
§9 (Future improvements) tightened to what's actually left: remainder
of Phase 3 (more rules pending a third-agent corpus + Class-B catalog
expansion), Phase 4 (telemetry + learned ranker, blocked on Data
Checkpoint D), and a "what EC3 will tell us" subsection that frames
EC3 as a fresh measurement rather than an incremental delta on EC2.
Glossary gains: rule carve-out, pre-emit ranker, symbol-binding
contract, ProjectReference resolution, cross-agent audit, provenance.
TOC updated for the §8 rename ("in this PR" -> "so far").
Tone matches the existing doc: plain language first, then engineer
detail, then ML-practitioner detail. The same explanatory pattern
spec 038's design doc uses.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PASS-with-caveats. Calc cleared the >=5% token-improvement bar (-5.2% mean, -13.0% median); kanban regressed (+14.9% mean, +60.7% median). The three Class-A rules added in this branch fired zero times across all 10 variant runs - the EC3 delta did not exercise. The calc improvement is plausibly driven by the CompilationLoader + gate carve-out fixes letting rules run at all, not by the new rules. Tool-call profile diff identifies the +3.2 turn delta on kanban: variant agent does ~+1 skill load, +1 view, +1 apply_patch per run vs base, consistent with a "verify-before-edit" loop triggered by rule suggestions. Mechanism cited in handoff section 7. Recommend: do not declare Phase 3 cleared on this batch alone. Re-run with prompts that target GridSize/TextBlock patterns to get Class-A rule exercise; investigate the kanban-base R1 outlier (1.12M tokens, 3.4x median) before reading the kanban regression as decisive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…edit framing Per-call inspection of variant kanban rg patterns and view paths doesn't support the earlier "verify-before-edit" hypothesis: ~11/12 rg calls probe drag/drop and modifier APIs unrelated to mur output; view calls are mostly the agent re-reading its own in-progress workspace files. The two rule-fired runs (r1=Theme, r4=Align) are middle-of-the- pack on turns and tools, not the heaviest. The variant mean is dragged up by r5 (20 turns, 27 tool calls, 889K tokens, zero rule fires) which looks like a generic long-tail trajectory comparable to base R1. Reframing: rule fires correlate with normal token usage when they happen; mur check can't help on builds where the agent's mistakes fall outside the rule set's coverage. The kanban-prompt -> rule- coverage gap is the underlying issue, not rule-induced verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-character bug in tools/Templates/templates/WinUIApp-CSharp/
.template.config/template.json lines 5-6: the template's identity and
groupIdentity were "Micrsoft.UI.Reactor.CSharp" / "Micrsoft.UI.Reactor"
(missing the second 'o'). Checked into the repo since at least Phase 1.
How it surfaced. The eval harness runs `dotnet new install ... --force`
on every setup. Multiple installs accumulate duplicate entries in the
user's ~/.templateengine/dotnetcli/<sdk>/templatecache.json under the
misspelled identity. The duplicate-match condition makes
`dotnet new reactorapp` resolve more than one template for the
"reactorapp" short name, throwing "Sequence contains more than one
matching element" with exit code 70. The EC3 5x2 batch hit this 20/20
runs across both arms — the spec doc's earlier framing ("agent typo,
at least one variant kanban run, didn't block the build") was wrong on
three counts; corrected in this commit.
Why the existing integration test (CreateTemplateTests) didn't catch
the typo: it installs the template into a per-test ephemeral
--debug:custom-hive, where the misspelled identity is the only entry
and `dotnet new` resolves correctly. The bug only surfaces against the
user's real (accumulating) cache. The new test (described below) is
content validation, not install/run behavior — orthogonal coverage
that catches the typo regardless of cache state.
Test added: tests/Reactor.Tests/TemplateMetadataTests.cs. Four xUnit
[Fact]s that load template.json directly:
- Identity_is_canonical_brand_namespace: exact-match assertion
against "Microsoft.UI.Reactor.CSharp".
- GroupIdentity_is_canonical_brand_namespace: exact-match against
"Microsoft.UI.Reactor".
- File_contains_no_brand_typos: substring sweep for "Micrsoft"
anywhere in the file (belt-and-suspenders catch for future typos
in any new symbol/description/etc.).
- ShortName_resolves_to_reactorapp: anchors the public CLI command
name documented in SKILL.md and the wordpuzzle smoke pattern.
Workstation cache drained + reinstalled: `dotnet new uninstall
Microsoft.UI.Reactor.ProjectTemplates` repeated until empty, then
`mur pack-local` repacked against the fixed template, then
`dotnet new install` reinstalled. ~/.templateengine cache now carries
exactly one canonical "Microsoft.UI.Reactor.CSharp" entry across both
SDK versions on disk (10.0.104, 10.0.203).
Existing tests unaffected: Reactor.Tests 7179 passing / 46 expected
skips (up from 7175, +4 from the new template-metadata tests).
CreateTemplateTests integration smoke (`dotnet new reactorapp` + build
+ run + UI Automation find) passes 2/2 with the corrected identity.
EC3 verdict implication: both arms hit the typo equally, so the
relative deltas (calc -5.2%, kanban +14.9%) are not biased *by this
bug*. Absolute costs are inflated on every run; the long-tail outliers
(variant r5 = 889K tokens, base r1 = 1.12M tokens) likely had their
trajectories pushed further by `dotnet-new` thrash. The PASS-with-
caveats verdict still stands directionally; a re-run with the typo
fixed could materially shift the numbers in either direction. Spec doc
updated to reflect this.
Two harness-side mitigations deferred to separate follow-ups (the
source typo is the load-bearing fix; without it the harness mitigations
would still leak):
1. `dotnet new uninstall Microsoft.UI.Reactor.ProjectTemplates`
before `dotnet new install --force` in eval setup, so future
typo-equivalent bugs can't accumulate.
2. Propagate inner-command exit codes into the PowerShell tool
wrapper's `success` field so `failedToolCalls` stops lying about
dotnet-new failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EC3-post-typo-fix smoke trace analysis identified two reads following the scaffold: `view App.cs` (essential — the file the agent is about to apply_patch) and `view <project>.csproj` (defensive — the agent checking that the scaffold produced a sane csproj). The .csproj read is informational at best: the scaffold's stdout already showed the file listing plus "Restore succeeded.", and a calc/kanban-shaped task never modifies the .csproj. Across the prior 10 variant runs, calc averaged 2.2 views/run and kanban averaged 2.2 (r5's 4 reads pulling the kanban mean up). Two views post-scaffold is the modal pattern, so a one-line skill note landing on the defensive read should compress noticeably. Added the same one-line note in two places so both skill consumers see it: - plugins/reactor/skills/reactor-getting-started/SKILL.md right after the canonical .csproj block (line ~102, next to the WindowsPackageType / UseWinUI MUST rules). - SKILL.md (top-level, packed into the nupkg) right after the matching csproj block in the Project Setup section. The wording explicitly carves out App.cs as still-necessary so the note doesn't suppress useful reads. Estimated savings: one view + a few hundred tokens per scaffold step. Small per-run, real across the batch since every eval scaffolds. Repacked Microsoft.UI.Reactor.0.0.0-local.nupkg so the bundled agentkit/SKILL.md carries the update. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anti-probe + mur check pointer
Three related changes addressing the post-scaffold agent-confusion pattern
the EC3-post-typo-fix trace analysis identified:
1. tools/Templates/templates/WinUIApp-CSharp/Company.ReactorApp1.csproj
removes <ImplicitUsings>enable</ImplicitUsings> and the three <Using
Include="..."> items. Reactor's namespaces are now explicitly
`using`-imported at the top of App.cs:
using Microsoft.UI.Reactor;
using Microsoft.UI.Reactor.Core;
using static Microsoft.UI.Reactor.Factories;
Why: with implicit usings on, the source file looks like it's missing
namespace context — `VStack`, `Heading`, `Component` appear unqualified
without a visible `using`, which confuses agents reasoning about where
symbols come from. The agent has to read the csproj to find the global
Using items, then mentally merge them into App.cs's namespace scope.
Explicit usings make App.cs self-contained: every symbol's source is
one of the three using directives at the top of the file. The skill
text now says "App.cs has its own using directives at the top, which is
the only place you add new namespaces" — which is true after this
change.
2. SKILL.md + plugins/reactor/skills/reactor-getting-started/SKILL.md
expand the existing "trust the scaffolded .csproj" note into an
anti-probe paragraph that enumerates the exact post-scaffold file
list: "the workspace contains exactly two source files: App.cs (entry
point + initial component) and <Name>.csproj. There is no Program.cs
and no GlobalUsings.cs — modify App.cs in place."
Why: the eval orchestrator's trace analysis identified a recurring
"agent probes for files that don't exist" pattern (sometimes asking
for Program.cs, sometimes inspecting obj/GlobalUsings.g.cs). Pinning
the file list in the skill is a one-paragraph fix.
3. Same two SKILL.md files add a 1-paragraph mur check pointer alongside
the anti-probe note: "Verify your edits with mur check before
declaring done... For anything more involved than the build/fix loop —
strict-mode failures, custom diagnostic gating, MSBuild passthrough
flags — load the reactor-build-and-check skill."
Why: the deeper reactor-build-and-check skill is a heavy load (full
--strict / --final / --quiet / --emit-threshold / --suppress-error
surface plus the iter/final framing). Most agent runs just need the
basic loop. Promoting mur check into getting-started with a one-liner
for the basic case lets the agent stay in the lighter skill until
they actually hit advanced behavior.
Verified: dotnet new reactorapp -n X builds clean in both the top-level-
program default and the --use-program-main true variant. Existing
CreateTemplateTests integration smoke (2/2) and TemplateMetadataTests
unit tests (4/4) pass. mur pack-local refreshes both nupkgs against the
new template + skill content.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prior commit (94d563f) added three usings to App.cs (Microsoft.UI. Reactor, .Core, static Factories) so the source is self-contained after dropping <ImplicitUsings>. The skill's "Required imports" section documents the full *canonical* set as five-plus-one — adding Microsoft.UI.Reactor.Layout, Microsoft.UI.Xaml, and Microsoft.UI.Xaml.Controls to the minimum three. The template and the skill now diverged: the agent reading App.cs would see three usings but the skill text says the canonical set is six. Sync the template to the skill: App.cs now ships all five-plus-one using directives, with the same `// FlexDirection, FlexJustify, ...` inline comments the skill uses for each non-obvious namespace. The starter App.cs still only uses three of them (Reactor, Core, static Factories); the other three are there because the agent will reach for them within the first ~5 turns of any real app (alignment enums, InfoBarSeverity, FlexDirection). Updated the SKILL.md anti-probe paragraph in both copies to point at `using System.Linq;` as the example of "when you add a new namespace, add it to App.cs's using block" — System.Linq is a real common add and isn't in the canonical six, so the example stays accurate. The top- level SKILL.md also explicitly names the canonical set so readers can cross-reference without flipping to the imports section. Verified: dotnet new reactorapp -n X builds clean in the default variant. CreateTemplateTests integration smoke 2/2 and TemplateMetadataTests 4/4 pass against the expanded usings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With ImplicitUsings disabled (template commit 94d563f) the agent doesn't get System auto-imported. Common BCL surface — Action, Func, EventArgs, DateTime, Math, TimeSpan, Random — all live there, and they show up within the first few turns of any non-trivial app (event handlers, timers, randomization, formatting). Adding `using System;` to the template's App.cs eliminates the "Action does not exist in the current context" miss that's otherwise the first thing the agent hits when they author an event handler. Synced the canonical set in three places so they stay coherent: - tools/Templates/templates/WinUIApp-CSharp/App.cs (scaffold output) - plugins/reactor/skills/reactor-getting-started/SKILL.md "Required imports" code block - SKILL.md anti-probe note's parenthetical canonical-set list Verified: scaffolded App.cs ships `using System;` at the top of the canonical seven-line using block; default-variant build clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five surgical removals identified by the analysis pass:
A. "Minimal app — single file" block (36 lines → 5). The single-file
`dotnet run App.cs` flow is a side path now that `dotnet new
reactorapp` is the primary entry; the canonical-shape teaching
moved into the scaffolded App.cs. Kept a 1-paragraph pointer to
reactor-build-and-check's single-file-scripts section for the
demo case.
B. Standalone `.csproj` xml block (17 lines dropped). The xml taught
the agent how to write a csproj from scratch — but the agent
doesn't author one. `dotnet new reactorapp` produces it. Kept the
"when to use a .csproj" framing + the WindowsPackageType /
UseWinUI MUST-rules + the recently-added anti-probe + mur check
paragraphs.
C. "Mode detection — selfhost vs. NuGet consumer" section (29 lines
→ 2). The top-level SKILL.md already owns selfhost/consumer
bootstrap; re-explaining it here was a second copy. The new
one-paragraph "Bootstrap" section breadcrumb-points readers to
SKILL.md and keeps the load-bearing `mur pack-local` recovery
tip inline.
D. "App entry point" section (13 lines → 0). The ReactorApp.Run<App>
form is already in the scaffolded App.cs. The unique content was
the inline-render-function form `ReactorApp.Run("T", ctx => ...)`
— embedded that as a one-line addendum to §Components instead of
carrying a whole section for it.
E. "Where the skill content comes from" package-cache directory tree
(6 lines dropped). The literal `%USERPROFILE%\.nuget\...` block
was reference material an agent can `find` on demand. Kept the
plugin-channel framing + the api-index pointer + the "read once,
cache in working memory" tip.
What's preserved unchanged:
- The React→Reactor table (highest-value block in the file)
- Components / Hooks / Common factories / Theme tokens / Critical
gotchas (load-bearing reference content)
- The recent anti-probe + mur check paragraphs
- The trimmed sections still carry their breadcrumb pointers so
agents looking for the removed content find their way to the
right skill (reactor-build-and-check, top-level SKILL.md, etc.)
Tier 2 (move drag-and-drop to reactor-input, trim Context, drop
duplicate List/UseReducer callout) and Tier 3 (move ContentDialog +
Flyout to reactor-recipes) are follow-up considerations, not applied
in this commit — they want eval validation before landing.
Repacked Microsoft.UI.Reactor.0.0.0-local.nupkg so the bundled
agentkit/plugins/reactor/skills/reactor-getting-started/SKILL.md
carries the trimmed file (verified: nupkg copy is 415 lines, matches).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Append the EC3-final results subsection to the implementation tasks doc; mark EC3-original as superseded (preserved as the historical record of the typo-contaminated batch and the PASS-with-caveats reasoning that drove the watch-item triage work). EC3-final headline numbers (5×N landed 2026-05-12 on eval/spec-038-ec3-2026-05-11 @ 053afe9): calc: tokens −33.7% mean / −37.1% median, cost −25.6%, turns −2.2, CV 28.4%, first-build 5/5 kanban: tokens −21.2% mean (median +31.7% is the distribution-tightening artifact, not a regression — base CV 74% bimodal vs variant CV 19.5% no-fat-tail), cost −25.7%, turns 0, first-build 5/5 The 4× kanban CV improvement is the load-bearing finding — second batch in a row (after EC1-RR) where the predictability-as-a-feature signal shows up, first batch where calc also tightens. All four EC3 pass criteria cleared. Spec §12's "~−$0.70 per run" prediction comfortably exceeded on both arms ($0.66 calc, $1.08 kanban). Spec EC3 row's "~−2 turns" prediction hits calc exactly. One unresolved footnote: per-rule firing counts weren't broken out in this batch. EC3-original was 0/10 on the three new Class-A rules; this clean PASS may be carried entirely by the structural fixes + template + skill changes with the three new rules still inert. The verdict supersedes EC3-original regardless (rules are correct in isolation, pass bars #1-#4 + #6, don't actively harm when silent), but the targeted-prompt batch at C:\temp\mur-targeted-prompt-spec.md remains the load-bearing follow-up for getting empirical token- impact numbers on those three rules specifically. Watch-items carried into V1 / Phase 4 review: - Class-A rule exercise via targeted-prompt batch - §11 risk-row guardrail retrofit (post-run mur check --final audit) - Tier-2 SKILL.md trims (now empirically de-risked) - rule_fired trace event addition Verdict: PASS, clean. Phase 3 V1 cleared to ship. PR #250 updated with the same verdict in its body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR advances mur check’s “did-you-mean” engine (spec 038 Phase 3) by adding three new Class‑A Tier‑3 rules, fixing two production-blocking correctness gaps (ProjectReference reference resolution + Tier‑3 suggest-gate carve-out), and repairing/streamlining the dotnet new reactorapp template and associated skill/docs/test coverage.
Changes:
- Add three new Tier‑3 Class‑A induced rules (GridSize parens, GridSize Px rename, TextBlock Style hint) plus fixture tests and a perf-bound test.
- Fix real-world rule execution by resolving ProjectReference outputs from
project.assets.jsonand by gating Tier‑2 only (Tier‑3 rules always run when their codes surface). - Fix template identity typo and adjust the template shape (drop implicit/global usings; add explicit
usingblock inApp.cs), plus skill/docs/changelog updates.
Show a summary per file
| File | Description |
|---|---|
| tools/Templates/templates/WinUIApp-CSharp/Company.ReactorApp1.csproj | Removes implicit/global usings from the scaffolded project. |
| tools/Templates/templates/WinUIApp-CSharp/App.cs | Adds explicit using directives for the scaffolded starter app. |
| tools/Templates/templates/WinUIApp-CSharp/.template.config/template.json | Fixes template identity/groupIdentity typo (Micrsoft → Microsoft). |
| tests/Reactor.Tests/TemplateMetadataTests.cs | Adds unit tests guarding template metadata/branding invariants. |
| tests/Reactor.Tests/CheckCommandTests/SuggesterOrchestratorRuleTests.cs | Adds tests asserting Tier‑3 rules still fire when Tier‑2 is suggest-gated off. |
| tests/Reactor.Tests/CheckCommandTests/Rules/TextBlockStyleHintRuleTests.cs | Fixture tests for TextBlockStyleHintRule (positive + negative). |
| tests/Reactor.Tests/CheckCommandTests/Rules/RulePerformanceTests.cs | Adds perf bound test for per-rule BestMatch cost. |
| tests/Reactor.Tests/CheckCommandTests/Rules/GridSizePxRenameRuleTests.cs | Fixture tests for GridSizePxRenameRule (positive + negative). |
| tests/Reactor.Tests/CheckCommandTests/Rules/GridSizeFactoryParensRuleTests.cs | Fixture tests for GridSizeFactoryParensRule (positive + negative). |
| tests/Reactor.Tests/CheckCommandTests/CompilationLoaderTests.cs | Adds regression test for resolving ProjectReference-built DLLs via assets.json. |
| src/Reactor.Cli/Check/SuggesterOrchestrator.cs | Introduces tier2Enabled gating (Tier‑2 only) while always allowing rules. |
| src/Reactor.Cli/Check/Rules/ThemeBackgroundSuffixRule.cs | Updates rule header docs to reflect Class‑A evidence/reclassification. |
| src/Reactor.Cli/Check/Rules/TextBlockStyleHintRule.cs | New Tier‑3 rule for missing TextBlockElement.Style patterns. |
| src/Reactor.Cli/Check/Rules/GridSizePxRenameRule.cs | New Tier‑3 rule mapping legacy Pixel/Pixels/Fixed → Px. |
| src/Reactor.Cli/Check/Rules/GridSizeFactoryParensRule.cs | New Tier‑3 rule for GridSize.<property>() CS1955 parens removal. |
| src/Reactor.Cli/Check/CompilationLoader.cs | Resolves ProjectReference outputs by scanning libraries entries in assets.json. |
| src/Reactor.Cli/Check/CheckCommand.cs | Loads compilation once and wires tier2Enabled through the orchestrator. |
| skills/reactor.api.txt | Updates API index content (new surfaced APIs). |
| SKILL.md | Updates top-level skill guidance (anti-probe + mur check workflow notes). |
| plugins/reactor/skills/reactor-getting-started/SKILL.md | Trims/reshapes getting-started skill and synchronizes scaffold/import guidance. |
| plugins/reactor/skills/reactor-dsl/references/reactor.api.txt | Updates packaged API index copy. |
| docs/specs/tasks/038-tuning-reports/2026-05-11-cross-agent-audit.md | Adds cross-agent reproducibility audit writeup. |
| docs/specs/tasks/038-mur-check-did-you-mean-implementation.md | Updates spec task status/results narrative through EC3 findings. |
| docs/reference/mur-check-did-you-mean.md | Expands reference doc to cover Phase 2–3 behavior and fixes. |
| CHANGELOG.md | Records new rules and correctness fixes under Unreleased. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 25/25 changed files
- Comments generated: 2
…on non-suggestable builds
Copilot review surfaced two substantive issues; both fixed.
(1) RulePerformanceTests CombinedStub only carried targets for the
three earlier Class-B rules. With RuleRegistry.Default now including
the three new Class-A rules (GridSizeFactoryParens, GridSizePxRename,
TextBlockStyleHint), those three were silently self-disabling during
the perf test — TargetsResolve failed against the stub's missing
GridSize and TextBlockElement types. The 'budget = 0.5ms × ruleCount'
assertion then scaled by registry.All.Length (six) while only
measuring three rules' actual cost, so the bound was 2× loose.
Fix: extended the stub with Microsoft.UI.Reactor.GridSize (record
struct with Auto/Star/Px matching the real shape) and
Microsoft.UI.Reactor.Core.TextBlockElement (record). Added a
stub-coverage guard at the top of the perf test that asserts every
rule in RuleRegistry.Default.All resolves its declared targets
against the test compilation — fails loudly with the missing target
name and rule name if someone adds a new rule without updating the
stub. Future-proofs the budget assertion.
(2) CheckCommand.Run unconditionally called
CompilationLoader.Instance.Load(path) after the EC3 gate carve-out
refactor, even when no parsed diagnostic could plausibly produce a
suggestion (no diagnostics at all; only Tier-2 codes with the gate
closed and no rule covering them; only nullable/XML-doc warnings).
The compilation load is 50–500 ms cold — .cs enumeration, file-set
hash, full reference resolution including the new ProjectReference
walk. Paying it on every clean mur check was wall-time regression
on the happy path.
Fix: added SuggesterOrchestrator.AnyDiagnosticIsSuggestable(diags,
tier2Enabled, rules) — flat scan over the (small) diagnostic list
against the union of Tier-2's SupportedCodes and every rule's
DiagnosticCodes. Microseconds. CheckCommand.Run now gates the
compilation load behind that pre-check: only loads when at least one
diagnostic could plausibly produce a suggestion.
Test coverage:
- RulePerformanceTests: stub-coverage guard asserts every
DeclaredTarget across RuleRegistry.Default.All resolves.
- SuggesterOrchestratorRuleTests gains 5 new facts:
* empty diag list → false (clean build skips load)
* unrelated CS warnings (CS8602/CS8618) → false
* CS1061 + tier2Enabled=true → true
* CS1061 + tier2Enabled=false + no rule → false (gate-closed
Tier-2-only path is non-suggestable)
* CS1955 covered by rule + tier2Enabled=false → true (Tier-3
always runs)
Verified:
- Reactor.Tests 7184 passing / 46 expected skips (was 7179, +5).
- CreateTemplateTests integration smoke 2/2.
- Clean wordpuzzle mur check exits with no output (pre-check
short-circuits — no compilation load).
- Wordpuzzle with GridSize.Pixel(80) + GridSize.Auto() injected:
both rules still fire under the default gate with full evidence
suffixes. Pre-check correctly identifies the build as
suggestable; nothing regressed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Both Copilot CR comments addressed in 1. Fix: extended 2. Fix: added
Verification. Reactor.Tests 7184/46 (was 7179, +5 for the new pre-flight facts). |
Use the modern Windows TitleBar (drag region, system menu, themed caption) as the top-of-window element and wrap content in a Border with 24px padding. Apply the same polish to the `mur --create` scaffolder so both entry points produce a presentable starter app. Align the scaffolder's emitted usings with the dotnet new template (PR #250) so generated apps have the common WinUI/Reactor namespaces ready to go. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
GridSizeFactoryParensRule(CS1955, 146 events combined, top freq in both corpora, first cross-tier rule),GridSizePxRenameRule(CS0117, 9 events),TextBlockStyleHintRule(CS1061/CS0117, 5 events across two syntactic shapes).ThemeBackgroundSuffixRulereclassified Class-B → Class-A in the file-header comment.CompilationLoadernow resolvesProjectReferenceoutputs fromproject.assets.json'slibraries.<id>entries withtype=project. Without this, Reactor itself (a project reference for every sample app) was invisible toRuleSymbolResolverand every rule'sDeclaredTargetsfailed — the whole registry self-disabled on every real invocation.SuggesterOrchestratortakestier2Enabled: bool; Tier-3 rules always run when their diagnostic code surfaces, Tier-2 stays gated. EC2 watch-item ("Phase-3 rules are the right lever — not Phase-2.x gate tuning") finally addressed in code.Micrsoft.UI.Reactor.CSharp→Microsoft.UI.Reactor.CSharpintemplate.json) — was breakingdotnet new reactorappresolution against accumulating template caches; hit 20/20 runs across both arms of the EC3-original batch.<ImplicitUsings>enable</ImplicitUsings>from the scaffolded csproj; explicit using directives (System + Microsoft.UI.Reactor + .Core + .Layout + Xaml + Xaml.Controls + static Factories) baked intoApp.cs. Every symbol now traces to a visibleusingat the top of the file. The starter App.cs only uses three of the seven imports — the rest are there for the namespaces the agent reaches for within the first few turns of any real app.reactor-getting-startedplugin copy) gains the anti-probe +mur checkpointer paragraphs that the EC3 trace analysis identified as load-bearing.reactor-getting-startedTier-1 trims (509 → 415 lines, −18%) — dropped the single-filedotnet runminimal-app block, the standalone csproj xml, the Mode-detection section duplicated by top-level SKILL.md, the App-entry-point section, and the package-cache directory tree. No load-bearing content removed; all five cuts have breadcrumb pointers to where the displaced content lives.docs/specs/tasks/038-tuning-reports/2026-05-11-cross-agent-audit.mdcloses Data Checkpoint C's reproducibility bar. Reference docdocs/reference/mur-check-did-you-mean.mdexpanded to cover Phase 2 + 3, the cross-agent mining methodology, the gate carve-out, and the ProjectReference fix. EC3-original (PASS-with-caveats) + EC3-final (clean PASS) results recorded indocs/specs/tasks/038-mur-check-did-you-mean-implementation.md.Phase 3 V1 ship verdict
EC3-final clean PASS landed 2026-05-12 — supersedes the EC3-original PASS-with-caveats verdict captured under that batch's contaminated-substrate run. Clean batch on
eval/spec-038-ec3-2026-05-11HEAD against the existing n=5 baseline:failedToolCallsThe kanban-median +31.7% delta is a distribution-tightening story, not a regression story. Base kanban distribution was 263K–1,118K tokens (CV 74%), bimodal — most runs sat near the floor, one r1 blowout dragged the mean while the median stayed artificially low at 304K. Variant kanban is 261K–464K (CV 19.5%), no fat tail, every run within 1.8× of best. The load-bearing finding is the 4× CV improvement, which is the predictability-as-a-feature signal the spec §11 risk row called out as deployable-workflow value (separate from any token-mean win). Second batch in a row (after EC1-RR) where this mechanism shows up; first batch where calc also tightens.
All four pass criteria cleared:
failedToolCalls0/0; §11 guardrail retrofit (post-runmur check --finalaudit) still deferred for high-confidence assertionFull results table at
docs/specs/tasks/038-mur-check-did-you-mean-implementation.md§ "EC3-final results — 5×N landed 2026-05-12".One footnote worth recording
EC3-original measured 0/10 firings on the three new Class-A rules (
GridSizeFactoryParensRule/GridSizePxRenameRule/TextBlockStyleHintRule). EC3-final doesn't break out per-rule counts, so we can't say whether the clean-PASS win includes any contribution from those three rules or whether it's entirely the structural fixes + template + skill changes carrying the result. The clean PASS supersedes the EC3-original verdict regardless — the rules are correct in isolation, pass Validation Gate bars #1–#4 + #6, and don't actively harm when silent. But "Phase 3 V1 shipped on Class-A rules that may not have fired in production-ish eval" is a footnote worth recording for whoever picks up this work next. The targeted-prompt batch atC:\temp\mur-targeted-prompt-spec.mdis the load-bearing follow-up for empirical token-impact numbers on the three Class-A rules specifically.Test plan
dotnet test tests/Reactor.Tests/Reactor.Tests.csproj -c Debug -p:Platform=x64— 7179 passing / 46 expected skipsdotnet test tests/Reactor.IntegrationTests/Reactor.IntegrationTests.csprojwithCreateTemplateTestsfilter — 2/2 passing on the corrected template identitymur check --list-rulesshows all six rulesenabledwith zero self-disables againstsamples/apps/wordpuzzle--suggest-threshold 3: injectGridSize.Pixel(80)+GridSize.Auto()→ both rules fire with full evidence suffixes (gate carve-out verified live)mur pack-localagainst branch HEAD —Microsoft.UI.Reactor.0.0.0-local.nupkgcarries the corrected template identity, explicit-usings App.cs, no implicit usings, and the trimmedagentkit/plugins/reactor/skills/reactor-getting-started/SKILL.md~/.templateenginecache drained of staleMicrsoft.UI.Reactor.CSharpentries and reinstalled cleanSurface area
src/Reactor.Cli/Check/Rules/{GridSizeFactoryParens,GridSizePxRename,TextBlockStyleHint}Rule.cssrc/Reactor.Cli/Check/{CompilationLoader,SuggesterOrchestrator,CheckCommand}.cs,src/Reactor.Cli/Check/Rules/ThemeBackgroundSuffixRule.csRulePerformanceTests.cs(§3.1a perf bound) +TemplateMetadataTests.cs(typo regression)CompilationLoaderTests.cs,SuggesterOrchestratorRuleTests.cstools/Templates/templates/WinUIApp-CSharp/.template.config/template.json(typo fix),Company.ReactorApp1.csproj(drop ImplicitUsings),App.cs(explicit usings)SKILL.md,plugins/reactor/skills/reactor-getting-started/SKILL.md(Tier-1 trims, anti-probe note,mur checkpointer, canonical-usings sync)docs/reference/mur-check-did-you-mean.md(expanded through Phase 2 + 3 + cross-agent methodology);docs/specs/tasks/038-mur-check-did-you-mean-implementation.md(status snapshot, EC3-original + EC3-final results, cross-agent audit verdicts); newdocs/specs/tasks/038-tuning-reports/2026-05-11-cross-agent-audit.md## [Unreleased]🤖 Generated with Claude Code